CS 6740 : Advanced Language Technologies February 4 , 2010 Lecture 3 : Pivoted Document Length Normalization

نویسندگان

  • Lillian Lee
  • Lakshmi Ganesh
  • Navin Sivakumar
چکیده

In this lecture, we examine the impact of the length of a document on its relevance to queries. We show that document relevance is positively correlated with document length, and see that relevance scores that use the normalization techniques we’ve studied so far (L∞, L1, L2) do not capture this correlation correctly. Finally, we present the “pivoted document length normalization” technique introduced by Singhal et al. in [SBM96], which addresses this issue.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Length Normalization

In the previous lecture we discussed pivoted document length normalization [Singhal et al. 96], a simple technique that applies a correction for the observation that document relevance correlates with document length. Through careful empirical verification of previous assumptions, they showed that the seemingly simple normalization term could have a big impact on results. However, in our discus...

متن کامل

Document Normalization Revisited

Cosine Pivoted Document Length Normalization has reached a point of stability where many researchers indiscriminantly apply a specific value of 0.2 regardless of the collection. Our efforts, however, demonstrate that applying this specific value without tuning for the document collection degrades average precision by as much as 20%.

متن کامل

IIT TREC 2005: Genomics Track

For the TREC-2005 Genomics Track ad-hoc retrieval task, we report on the development of a scalable information retrieval engine based on a relational data model for the integration of structured data and text. Our objectives are to meet the need for the integrated search of heterogeneous data sets of biomedical literature and structured data found in biological databases, and to demonstrate the...

متن کامل

CS 674 / INFO 630 : Advanced Language Technologies Fall 2007

At the end of the previous lecture we were talking about how to incorporate implicit relevance feedback which came in the form of preferences, i.e. instead of absolute judgments (this document is relevant and that document is not) we had information from clickthrough data in the form of relative judgments (this document is more relevant than that document). We ended up with some sort of vector ...

متن کامل

Cs 674/info 630: Advanced Language Technologies Lecture 7 — September 18 2 Incorporating Term Frequencies

Apart from IDF, term frequencies are also important and we would like to incorporate them into our scoring function. From now on, we will treat Aj as a random variable that denotes the number of occurrences of term j in a document. So, what should P (Aj = a) and P (Aj = a|Rq = y) be? In other words, how do we model the distributions of these random variables? Here we have two options: continuou...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010